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ELLIPTICAL GRAPHICAL MODELLING 


DANIEL VOGEL AND ROLAND FRIED 


Abstract. We propose elliptical graphical models based on conditional uncorrelatedness as a gen¬ 
eralization of Gaussian graphical models by letting the popnlation distribntion be elliptical instead 
of normal, allowing the fitting of data with arbitrarily heavy tails. We study the class of propor¬ 
tionally affine equivariant scatter estimators and show how they can be used to perform elliptical 
graphical modelling, leading to a new class of partial correlation estimators and analogues of the 
classical deviance test. General expressions for the asymptotic variance of partial correlation esti¬ 
mators, unconstrained and under decomposable models, are given, and the asymptotic chi square 
approximation of the pseudo-deviance test statistic is proved. The feasibility of our approach 
is demonstrated by a simulation study, using, among others, Tyler’s scatter estimator, which is 
distribution-free within the elliptical model. Our approach provides a robustification of Gauss¬ 
ian graphical modelling. The latter is likelihood-based and known to be very sensitive to model 
misspecification and outlying observations. 


1. Introduction and notation 


The statistical theory of undirected graphical models for continuous variables is usually based on 
the assumption of multivariate normality. In practice, data may deviate from the normal model in 
various ways. Outliers and heavy tails pose a problem of particular gravity: they frequently occur, 
and the normal likelihood methods, such as the sample covariance matrix, are very susceptible to 
them. Our objective is to deal with heavy-tailed data and to safeguard graphical modelling against 
the impact of faulty outliers. We restrict our attention to the case where we have only continuous 
variables and only undirected edges. Joint multivariate normality is often assumed in this situation, 
and the statistical methodology is called Gaussian graphical modelling. We propose the class of 
elliptical distributions as a more general model and call our approach elliptical graphical modelling. 

The lack of robustness of Gaussian graphical modelling has been noted by several authors. Four 


proposals of robust approaches to Gaussian graphical modelling are known to us: Becker (2005) 
and Gottard and Pacillo ( 2010| suggest replacing the sample covariance matrix by the reweighted 
minimum covariance determinant estimator. |Miyamura and Kano (2006) propose an alternative 
M-type estimation, and Finegold &: Drton (arXiv: 1009.3669) consider robustified versions of the 


graphical lasso by Friedman et al. (2008). 


This article delivers a systematic treatment of the plug-in approach used in the first two ref¬ 
erences. We show that the sample covariance matrix may be replaced by any affine equivariant, 
root-n-consistent estimator. As long as ellipticity can be assumed, the classical Gaussian graphical 
modelling tools can be employed with simple adjustments. Thus the data analyst is free to choose 
the appropriate estimator, delivering the degree of robustness necessary for the data situation at 
hand. In order to reduce the search space, graphical modelling is often restricted to decomposable 
graphical models, which allow better interpretability, cf. Whittaker (1990, Ghapter 12), but are 


also easier to handle mathematically. For conciseness we restrict our derivations to decomposable 
models. 

We close this section by introducing some mathematical notation. Depending on the context, the 
symbol ~ means distributed as or asymptotically equivalent. Finite index sets are denoted by small 
Greek letters. Subvectors and submatrices are referenced by subscripts, e.g. for a,/3 C {!,...,p} 
the |q;| X |/3| matrix is obtained from S by deleting all rows that are not in a and all columns 
that are not in /3. Similarly, the p x p matrix is obtained from S by putting all rows not 
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in a and all columns not in /3 to zero. We view this matrix operation as two operations performed 
sequentially: first {■)a,p extracting the submatrix and then writing it back on a blank matrix 
at the coordinates specified by a and /3. Of course, the latter is not well defined without the former, 
but this allows us to write for example. Subscripts have priority over superscripts, 

stands for Let =5^ and be the sets of all symmetric, respectively positive definite 

p X p matrices, and define as the diagonal matrix having the same diagonal as ^ G The 

Kronecker product A^Si B of two matrices A,B£ is defined as the p‘^ x matrix with entry 
0‘i,jbk,i at position {(i — l)p + k, {j — l)p + 1}. Let ei,..., Cp be the unit vectors in and Ip the 
p-vector consisting only of ones. Define the matrices: 

p p p ^ 

'^P = Y1 Oef ® e^ef, ^p = '^Y^ ® ^ 2 ’ 

2=1 2=1 j = l 

where /p 2 denotes the p^ xp^ identity matrix; Kp is also called the commutation matrix. Finally, let 
vec{A) be the p^-vector obtained by stacking the columns of A G RP^p from 
each other. More on these concepts and their properties can be found in 
( fT99^ . 

2. Elliptical graphical models 


left to right underneath 
Magnus and Neudecker 


We introduce elliptical graphical models in analogy to Gaussian graphical models. For details 
on the latter see Whittaker (1990), Cox and Wermuth (1996), Lauritzen (1996) or Edwards (2000). 

Consider the class S’p of all continuous, elliptical distributions on RP. A continuous distribution 
F on RP is said to be elliptical if it has a density / of the form 

(1) fix) = det(S’)"^/ 25 ({(x - p)'^S~^ix - p)} 

for some p G RP and symmetric, positive definite pxp matrix S. We call S the shape matrix of E, 
and denote the class of all continuous elliptical distributions on RP with the parameters p and S 
by (Fpip^ S). A continuous distribution on RP is called spherical if S is proportional to the identity 
matrix. The shape matrix S is unique only up to scale, that is, S’p{p, S) = ^pip, cS) for any c > 0. 
Several forms of standardization have been suggested in the literature. Paindaveine (2008) argues 
for det(5) = 1. For our considerations the standardization of S is irrelevant, and we understand 
the shape of an elliptical distribution as an equivalence class of positive definite random matrices 
being proportional to each other and call any matrix S satisfying a for a suitable function g a 
shape matrix of F. We likewise view its inverse K = S~^, which we call a pseudo concentration 
matrix of F. Furthermore let 


h:F^p+ 




i-D-V2 


and P = /i(5). The function h is invariant to scale changes, i.e., P is a uniquely defined parameter 
of E G £’p{p,S). The diagonal elements of P are equal to —1. If the second-order moments of 
A ~ E G S’pip,S) exist, then S = var(A) is proportional to S. Consequently, the element pij 
of E at position (i,j) is the partial correlation of Xi and Xj given the other components of X 
(Whittaker, 1990, Chapter 5). We call P the generalized partial correlation matrix of E and refer 
to it as partial correlation matrix for brevity. 

The qualitative information of P can be coded in an undirected graph G = {V, E), where V is the 
vertex set and E the edge set, in the following way: the variables Xi, ..., Xp are the vertices, and 
an edge is drawn between Xi and Xj if and only iipij / 0 ii,j = 1,... ,p]i j). The graph G thus 
obtained is called the generalized partial correlation graph of E. Formally we set V = {1,... ,p} 
and write the elements of E as unordered pairs {i,j} ihJ = Ij • • • 7^ j)- The global and the 

local Markov property with respect to any generalized partial correlation graph G are equivalent 
for any F ^ S’p without any moment assumptions ( [Vogel and Fried' 2010). 

Let [G) be the subset of S^p consisting of all positive definite matrices with zero entries at 
the positions specified by the graph G = (V, E), i.e.. 


K G y+{G) 


K G EL+, kij = 0 (f / j, {i,j} i E), 
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and define 

4(G) = {Fe 4(//,i^-^) IfieW, Ke S^+{G) } 

to be the elliptical graphical model induced by G. We call the model Sp{G) decomposable if G is 
decomposable, i.e., if it possesses no chordless cycle of length greater than three. For alternative 


characterizations and properties of decomposable graphs see e.g. Lauritzen (1996, Chapter 2). 


In the remainder of this section we discuss the interpretation of an absent edge in the partial 
correlation graph of F G 4- Let us assume that the second-order moments of X ~ F are finite. 
The partial uncorrelatedness of, say, Xi and X 2 given X 3 ,... ,Xp, i.e., pi ^2 = 0, is to be under¬ 
stood as linear independence of Xi and X 2 after the common linear effects of X 3 ,, Xp have 
been removed. A relation of similar type is conditional independence: roughly, Xi and X 2 are 
conditionally independent given A 3 ,..., Xp, if the conditional distribution of (Ai, A 2 ) is a product 
measure for almost all values of the conditioning variable (A 3 ,... ,Xp). In comparison to partial 
correlation we understand conditional independence as complete independence of Ai and A 2 after 
the removal of all common effects of A 3 ,..., Xp. 

Another related term is conditional uncorrelatedness: the conditional distribution of (Ai,A 2 ) 
given (A 3 ,... ,Xp) has correlation zero for almost all values of (A 3 ,... ,Xp). There is an impor¬ 
tant qualitative difference between partial and conditional correlation: the former is a real value, 
the latter a function of the conditioning variable. All marginal and conditional distributions of 


elliptical distributions are again elliptical ( 

Fang and Zhang 

1991 

1, Section 2.6). Hence partial un- 

correlatedness implies conditional uncorrelatedness ( 

Baba et al. 

2004 

), and pi ^2 = 0 means linear 


However, the only spherical distributions with independent margins are Gaussian distributions, 
cf. Bilodeau and Brenner (1999 p. 51). Thus contrary to Gaussian graphical models a missing 


edge in the partial correlation graph of an elliptical distribution can in general not be interpreted 
as conditional independence. It appears, that by going from the normal to the elliptical model, the 
gain in generality is paid by a loss in the strength of inference. But this loss is illusory. From a 
data modelling perspective the conditional independence interpretation of partial uncorrelatedness 
under normality is an assumption, not a conclusion. By modelling multivariate data by a joint 
Gaussian distribution one models the linear dependencies and assumes that there are no other 
than linear associations among the variables. By fitting an appropriate non-Gaussian model one 
may still model the linear dependencies and allow non-linear dependencies. Using semiparametric 
models embodies this idea: the aspects of interest, in our case linear dependencies, are modelled 
parametrically, whereas other aspects remain unspecified. 

Of course, non-normal data need not be elliptical. Any relevant data feature, such as non- 
linearities, anomalous values, etc., is of potential interest and should be analysed. If the data, 
say, contains strong quadratic interactions, models that incorporate them should be used, as it is 


described e.g. in Gox and Wermuth (1996, Section 2.10). We address primarily the situation where 
the essential structure of the data is captured by an ellipse, and the linear interactions are the 
prominent ones. In any case, a robust analysis of the linear effects, as proposed here, is a suitable 
starting point of any subsequent tests for potential non-linear effects. 


3. Unconstrained estimation 

An important initial step towards elliptical graphical modelling is the unconstrained estimation 
of P. Unconstrained, since we do not assume a graphical model to hold, not forcing any constraints 
on P. We will consider estimators of the type Pn = /i(4), where 4 is a suitable estimator of a 
multiple of S, therefore start by considering shape estimators Sn. 

Let Ai,... ,Xn be independent and identically distributed random vectors sampled from an el¬ 
liptical distribution F G S‘p{p, S). Depending on the context, A^ may denote the kth p-dimensional 
observation or the fcth component of the vector A. Furthermore let = (Ai,..., A„)^ be the 
nxp data matrix and 4 = 4(Xn) be a scatter estimator. The symbol 4 may have two meanings: 
a function on the sample space, or as abbreviation for 4(Xn), a random variable. We use the term 
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scatter estimator for any symmetric matrix-valued estimator that gives some information about 
the spread of the data. We call Sn affine pseudo-equivariant, if it satisfies 

(2) + 1,6'^) OC ASn{Xn)A^ 


for all b G IRP and full rank A G This is a generalization of the strict affine equivariance for 

scatter estimators, which is obtained if ([^ is satisfied with equality. We use this weaker condition 
since overall scale is irrelevant for partial correlations, and we want to include estimators which 
only estimate shape, but not scale, and do not satisfy strict affine equivariance. Examples are given 
in Section [ 6 l 


Tyler (1982) shows that, if a strictly affine equivariant scatter estimator is evaluated at an ellipti¬ 


cal distribution, its first two moments, if existent, have a common structure. If the proportionality 
factor in ([^ is not random, the same holds true for pseudo-equivariant scatter estimators. The 
following condition is therefore natural for affine pseudo-equivariant estimators at elliptical dis¬ 
tributions F, and many shape estimators have been shown to satisfy it under suitable additional 
conditions on F, see also the examples in Section 


Assumption 3.1. The estimator Sn converges in probability to r]S for some rj > 0, and there exist 
(Ti > 0 and CJ 2 > —2aifp such that 

n^/^vec(^ - rjS) Np2 {O, r/^lT5(cri, 0 - 2 )} 

in distribution as n ^ 00 , where Ws{(Xi,cr 2 ) = 2aiMp{S® S) +(T 2 vecS(vecS)^. The scalars ui and 
CJ 2 depend on the estimator the dimension p and the function g, but are constant with respect 
to the shape S. 


We have the following implication for the derived estimators Kn = and Pn = h{Sn)- 

then with K = S~^, 


3.1 


Proposition 3.2. If Sn satisfies Assumption 

(i) n^/^vec(An -rj~^K) Np 2 {O, ?7“^WXcri, 0 - 2 )} 

in distribution as n ^ 00 , where IEA'(cri, £ 12 ) = 2aiMp{K (g) K) + (T 2 vecA(vecA)’^, and 

(ii) n^/2 vec(Pn - P) ^ Np^ {O, 2f7ir(5)Mp(A ® K)r{Sf} 

in distribution as n —)• 00 with r(S') = P Mp{P ® Kfy^)Jp. 


An important aspect of Proposition |3.2| is that under ellipticity the asymptotic covariance ma¬ 
trices of partial correlation estimators Pn derived from affine equivariant shape estimators Sn are 
proportional to each other. 


4. Constrained estimation 


In this section we treat the estimation of P under a given graphical model ^p{G) specified 
by the graph G = {V,E), i.e., estimating P with zero-entries. A crude approach is to put the 
concerning elements in an unconstrained estimate Pn to zero, but this generally destroys the positive 
definiteness of the estimate. We define the function he ■ ^p —)• lPp{G) : A 1 —)■ Aq by 


( 3 ) 


= aij ({i, j} e E V i = j), 

(^G )i,j — 0 {{hi} ^ Pi i 3) 1 

where aij are the elements of A. A unique and positive definite solution Aq of ([^ exists for 
an y positive definite A. The positive definiteness of A is sufficient but not necessary. For details 
(1996, p. 133). Since we mainly deal with asymptotics, and shape estimators ^ are 


see 


Lauritzen 


usually almost surely positive definite at continuous distributions for sufficiently large re, we assume 
positive definiteness for simplicity’s sake. 

Let G = {V, E) be a decomposable graph with cliques 71 ,..., 7 c (c > 1), and define the sequence 
5i,..., (5c -1 of successive intersections by 


4 = (71 U • • • U 7 A:) n 7 fc+i (A: = 1,..., c - 1). 
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We assume that the ordering 71 , • • • ,7 a: is such that the cliques form a perfect sequence, i.e., for 
all /c = 1 ,..., c — 1 there is a j G { 1 ,..., A:} such that 5^ C ■jj. It is always possible to arrange the 
cliques of a decomposable graph in a perfect sequence (Lauritzen, 1996 , Prop. 2 . 17 ). For notational 
convenience we let 


afc = 


Ck = 


Ik (A: = 1,... ,c), 

Jk-c (A: = c + 1 ,... , 2 c - 1 ), 

Then ho (A) allows the following explicit formulation for decomposable G, 

-1 

(71 G ^+). 


(A: = c + 1 ,..., 2 c — 1 ). 


hciA) = = I Cfc 


We will use this representation of he to further analyse the properties of the estimators Sq = 
hei^), Kg = and Pg = h{SG) for a decomposable graph G. Using the notation Sg = hG{S), 
K^ = S^\Pg = h{SG) G W^P and 

nG(s) = Ea(^i;.T)'''’®K7.)"” e 

k=l 

we have the following result about the asymptotic distribution. It is not assumed that the true 
shape S hts the model G. 


3.1 


and G is decomposable, then 


Proposition 4.1. If Sn fulfils Assumption 
(i) vec{KG — 3 ~^Kg) —)• A^p2{0,CJ2)} in distribution 

as n —)• 00 with CJ2) = 2aiMpIlGiS){S (8) S)IlGiS) + a2vecKG{vecKG)"^, 

(a) n^'^^vec( 5 G — tiSg) iVp2{0,7^1^5(3(171, CJ2)} in distribution as n —)• 00 

with Wsa{cri,cr 2 ) = 2aiMp {Sg ® Sb) Qg{S){S (g) S)Qg{S) {Sg ^ Sg) + <T 2 vec5G'(vecSG')'^, 

(in) Yec{PG — Pg) -%2{0, Wp^{ai)} in distribution as n ^ 00, where 

Wpq((7i) = 2(7ir(5b)-^p^G('S')(>S' (g) S)IIg{S)II{Sg)'^ with r(-) as in Proposition's^ 

If the true shape S satisfies the graph G, the expressions for the asymptotic variances simplify. 
Corollary 4.2. If Sn satisfies Assumption 3.1 with S~^ G =5G*"(G) for a decomposable graph G, 


then the assertions of Proposition 4-1 o,re true with 

Wk-q ((71,(72) = 2cjiMpQg(S') + ct 2 veciF(vec7C)^, 

(a) WsQ{(yi,(T2) = 2aiMp{S ® S)IIg{S){S ® S) + <72 vec 5 (vecS')'^ and 
(in) IUpg(cti) = 2 fTir( 5 )MpOG( 5 )r(,S)^. 


5. Testing 

An essential tool of most model selection procedures is to test if a model under consideration fits 
the data and to compare the ht of two nested models. On the set lip = {{i,j) \ i,j = 1,... ,p} of 
the positions of a p x p matrix we declare a strict ordering -<p by 

{hj)Pp{k,l) (j - l)p + A < (/- l)p +A:, {i,j,k,l = l,...,p). 

For any subset Z = {zi, ..., Zq} C lip, where Zk = {ik,jk) (^ = Ij • • ■ j ?) and ^^i Ap • • • Xp Zq, dehne 
the matrix Qz G as follows: each line consists of exactly one entry 1 and zeros otherwise. 

The 1-entry in line k is in column {ik — l)p+jk. Thus Qzvec{A) picks the elements of A at positions 
specihed by Z in the order they appear in vec(A). For a graph G = {V, E) with V = {1,... ,p} let 

K{G) = {{i,j) I i,j = 1,... ,p; {i,j} ^ E- j < i} , 

i.e., the set D{G) gathers all sub-diagonal zero-positions that G enforces on a concentration matrix. 
Thus E G S’p{G) is equivalent to Qd{g) vecTC = 0. 
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Now let Go = {V, Eq) and Gi = {V,Ei) be two decomposable graphs with V as above and 
Eq C El, or equivalently, (§’p(Go) C S’p{Gi). For notational convenience let 

Qo = Qd{Go)^ Qi = Qd{Gi)^ Qo,1 = Qd(Go)\-D(Gi)) 
furthermore 

qo = \D{Go)\, = |L)(Gi)|, qo,i = Qo — Qi- 

An intuitive approach to testing Go against the broader model Gi is to reject Go in favour of Gi, 
if all entries at positions in D{Go) \ D{Gi) of an estimate Pg^ of P under Gi are close to zero. For 
example, a sum of suitably weighted squared entries of Pgi, such as r„(Go, Gi) below, is a possible 
test statistic. Let 


Rg{S) = T{S)MpnGiS)T{Sf. 

For invertible S the matrix Rgi{S) has rank (p— l)p/2 — qi, which can be deduced from the inverse 
function theorem. Then Qo^iRgj^{S)Qq i is of full rank, and the probability that the Wald-type 
test statistic 

Qri |Qo,i-Rgi(‘^)Qri| QoqvecPGi 


^ n / ^ \ R 

r„(Go,Gi) = - (vecPci 


5.1 


exists tends to 1 as n —>• oo. Proposition 
under the null hypothesis that Go is true, part 


describes the asymptotic behaviour of r„(Go,Gi) 
dll), and under a local alternative, part dlil). 


Proposition 5.1. Let Go, Gi be as above and Xi,... ,Xn independent and identically distributed 
random variables with Xi ^ F £ S’p{p,S) C S'p{ Go). Let ^ be an affine pseudo-equivariant scatter 
estimator such that ^(Kn) satisfies Assumption 


3.1 


(TiXgo 1 distribution as n 


oo. 


(4) 


(i) ThenTn{Go,Gi) 

(a) For m £ TN let = {x[^\ ..., X^'^)'^ be distributed as , thus X^'^ 

(^p{pi,Sm), where the sequence Sm is such that B = — S) exists. If, for 

each n £^, Sn is applied to Hh ', then, as n ^ oo, 

fr,{Go,Gi) ^ aixl^^{af^6{B,S)} 
in distribution, where 


5{B,S) — Qox {Qo,iRgi{S)Qop} 


V = r(5)flGi (S) vecB. 


We have some remarks. 

(a) We define the non-centrality parameter of the distribution Xri^) ~ ^r))^ as 5 p.. 

(b) We require ^ to be affine pseudo-equivariant to ensure that the convergence of — 

pSra} for n —>■ oo is uniform in m. 

(c) In part (|I^ of Proposition |5.1| we do not require the sequence of alternatives to lie in the model 
Gi, i.e., that S~^ £ lX^{Gi), as it is not necessary for the convergence Q to hold. When 
choosing a model by forward selection one usually compares two wrong models, so it is of 
interest to know the behaviour of T„(Go, Gi) also if Gi is not true. 

A difficulty with the test in Proposition 5.1 is the complicated formulation of Tn{Go,Gi). The 


classical test in Gaussian graphical models is the deviance test. The next proposition gives the 
analogue for elliptical graphical modelling. It treats parts 0 and Q of the previous proposition 
simultaneously. 


Proposition 5.2. Let Go, Gi be as above and ^ a sequence of almost surely positive definite 
random p x p matrices, for which — S) converges in distribution to a non-degenerate limit 

for some S £ with S~^ £ =5^+(Go). Then, as n ^ oo, 

L)n(Go,Gi) = n ^logdethco (Sn) -log det ho fiSn)j ~ T„(Go,Gi). 
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If the larger model Gi is the saturated model, then Proposition |5.2| is a corollary of Theorem 2 


m 


Tyler| (|1983|). We extend Tyler’s result to two nested models. 

remain true, ifTn{Go, Gi) is replaced 


Corollary 5.3. Both assertions and Q) of Proposition 
by DniGo,Gi). 


5.1 


6. Examples 


There are many affine equivariant, robust estimators, 


see, 


for example, Zuo (2006) or Maronna 


et al. (2006). The comparison of asymptotic properties of such estimators in the elliptical model 


reduces to a comparison of the respective values of the scalars ai and a 2 . Of course, the sample 
covariance matrix is affine equivariant. The following can be found in [Tyler (1982). 


Proposition 6.1. If Xi ,..., are independent and identieally distri bute d with distribution F G 
Sp{ii,S) and E||Xi — pl\\^ < oo, then Sn = En(Xn) fulfils Assumption 
02 = k/3, where k is the excess kurtosis of any component of Xi. 


3.1 


with oi = 1 + k/ 3 and 


Proposition |6.1| indicates the inappropriateness of the sample covariance matrix for heavy-tailed 
distributions: its asymptotic distribution depends on the kurtosis, which is large at heavy-tailed 
distributions, rendering the estimator inefficient. An alternative is Tyler’s M-estimator, which is 
defined as the solution Vn = Vn{^n) of 


pA iX,-Xn){Xi-Xn)^ 
n {Xi -Xn)^Vn\X, -Xn) 


that satisfies detE^ = 1. Existence, uniqueness and asymptotic properties are treated in Tyler 
(1987), where the following result is proven. 


Proposition 6.2. If Xi ,..., are independent and identieally distributed with distribution F G 
Sp{pL,S), furthermore E||Ali — p|p < oo and E||Ai — < oo, then Vn fulfils Assumption 

with cji = 1 -b 2/p and 02 = —2(1 -|- ‘Ijp^jp. 


3.1 


We have the following remarks. 

(a) In Proposition |6.1| the scalars oi and 02 are constant, irrespective of the function g, i.e., the Tyler 
matrix is asymptotically distribution-free within the elliptical model. Hence, when carrying out 
any of the tests from Section oi does not need to be estimated. 

(b) Tyler’s matrix can cope with arbitrarily heavy tails. The assumption of finite second moments 

is only required for location estimation by the mean. It may be replaced by any root-n- 
consistent location estimator, for instance the Hettmansperger-Randles (2002) median. The 
inverse moment condition E||Ai — < 00 is fairly mild: for p > 2 it is fulfilled if g has 

no singularity at 0. 

(c) The estimator Vn is affine pseudo-equivariant and gives information only about the shape but 


none about the scale. Other such estimators are Oja sign and rank covariance matrices (Ollila 


et ah, 2003, 2004). 


The popular reweighted minimum covariance determinant estimator (Rousseeuw and Leroy, 1987| 


Croux and Haesbroeck 


1999| ) is highly robust and affine equivariant and has previously been 
proposed in the context of graphical modelling ( jBecker , |2005 Gottard and Pacillo, 2010). It is 
defined as follows. A subset r C {l,...,n} of size h = [tu], where l/2<t<lis fixed, is 
determined such that det(S'’') is minimal with 




^(X, - X^)(X, - X^)‘ 










The mean pmcd and covariance matrix Smcd computed from this minimizing subsample are called 
the raw minimum covariance determinant location and scatter estimates. The scatter part is scaled 
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to achieve consistency for the covariance at the Gaussian distribution. Based on the raw estimates 
a reweighted scatter estimator Srmcd is computed from the whole sample: 

( n \ n 

Wi I ^ Wi {Xi — flucD ) {Xi — Amcd, 

i=l J i=l 

where tCj = 1 if {Xi — — /Imcd) < Xp,i-a zero otherwise, and a is a small 

rejection probability, e.g. a = 0 • 05. The reweighted covariance estimate is again scaled, but since 
this is not necessary for our applications we omit the details. 


7. Numerical example 


We present the results of a simulation study comparing several estimators. We repeatedly sample 
100 independent observations of a 5-dimensional distribution. We use the same shape matrix 
throughout, with equal diagonal elements and the partial correlation structure represented by the 
graph in Figure We let the tail behaviour vary, using the normal distribution and several 

members of the family to generate heavier tails (Bilodeau and Brenner, 1999, p. 207). The 

index u denotes the degrees of freedom. The moments of t^^p are finite up to order u — 1. We may 
talk of a fixed shape of the t^^p distribution, since g is specihed. For u > 3, its covariance matrix is 
u(u — 2)~^S, and, for u > 5, the excess kurtosis of each component is 6/(u — 4). Propositions 


6.1 


and |6.2| imply that the Tyler matrix is asymptotically more efficient than the sample covariance 
matrix at t^^p v < p + A. For each distribution considered we generate 2000 samples, compute 
the estimates described in Section and, based on each estimate, select a model. 

We use a simple one-step model selection procedure, that allows us to concentrate on the effects 
of the different estimators. For each pair {i, j} we test the model with all edges but {i,j} against the 
saturated model, and exclude the edge {i,j} if the test accepts the smaller model. The significance 
level a = 0 • 05 is an ad hoc choice. In our simulations the Wald-type test statistic Tn and the 
deviance test statistic Dn showed a very similar behaviour. Tables and report the results of the 
deviance test. 

The main criterion by which we measure the goodness of the model selection is the mean edge 
difference, i.e., the average number of edges that are wrongly specified in the selected model, 
whether an existing edge was rejected or an absent edge was included. Although less suited as a 
performance criterion it is also of interest to know, how often the true model is found. Any model 
selection procedure that is based on testing for zero parameters aims at controlling the probability 
of correctly specifying the non-edges. We may also look at how often a single non-edge is correctly 
specihed. This should be true in about 95% of the cases, since a sample size of 100 seems large 
enough to e:™ect some validity of the asymptotics in this setting. 

In Table [ij we compare the sample covariance matrix to Tyler’s estimator Vn with the 
Hettmansperger-Randles median as location estimator. The benchmark is traditional graphical 
modelling, i.e., the performance of at the normal distribution. The classical deviance test de¬ 
teriorates, if we move away from normality. We assume only ellipticity of the distribution and 
hence adjust the S^-based test statistic by an estimate of di, which is here the average of the 
sample kurtoses of all component divided by 3, cf. Proposition |6.1[ This repairs the test, to some 


extent even in the case of the ts-distribution, but does not necessarily give a better model selec¬ 
tion. The estimator is inefficient under heavy tails, resulting in a test with low power. As for 



Figure 1. Example model, edge labels indicate partial correlations 
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Table 1. One-step model selection based on S or 1/ 


distribution 

estimator 

mean 

edge 

difference 

% true 
model found 

% non-edges 
correctly 
found 

% ®/(D 
correctly 
found 

normal 

S 

1 

•40 

21 

79 

95 


t* 

1 

•41 

20 

77 

94 


V 

1 

• 65 

14 

78 

94 

^25 

t 

1 

•44 

20 

75 

93 


t* 

1 

•44 

19 

78 

94 


V 

1 

• 64 

14 

78 

94 

tl2 

t 

1 

• 51 

20 

71 

92 


E* 

1 

• 51 

18 

79 

94 


V 

1 

• 66 

13 

79 

94 

h 

t 

1 

• 65 

17 

64 

89 


t* 

1 

• 65 

15 

76 

93 


V 

1 

• 62 

13 

79 

94 

tb 

E 

1 

• 90 

14 

51 

84 


E* 

1 

• 87 

10 

74 

93 


V 

1 

• 63 

14 

78 

94 

^3 

S 

2 

•49 

8 

29 

72 


E* 

2 

• 28 

7 

71 

91 


V 

1 

• 65 

14 

78 

95 


* test statistic adjusted by estimated kurtosis 


Tyler’s estimator, we recognize the asymptotic properties: the x^-quantile fits, it outperforms Sn 
at ti/-distributions with p < 9, and it is distribution-free within the elliptical model. 

In Tablej^we examine if the same robustness against heavy-tailedness may be achieved by equally 
simple means using other robust estimators and, in particular, how the previous proposals of robust 
Gaussian graphical modelling, the reweighted minimum covariance determinant estimator and the 
Miyamura-Kano estimator, perform in this situation. Outlier-robust estimators interpret the bulk 
of the data as approximately normal and the observations in the tails as faulty outliers, that should 
be downweighted or rejected. Although there are some common aspects, this is in principle a 
different situation, and it is not surprising that both estimators do not meet the performance of 
Tyler’s estimator at heavy-tailed distributions. Also, we did not estimate cxi from the data, but 
used its value for the normal distribution. For the reweighted minimum covariance determinant the 


values can be found in Croux and Haesbroeck (1999). But even in the Gaussian case, when ui is 
chosen asymptotically correct, the asymptotic y^-distribution does not seem to provide a sensible 
approximation. This small-sample inefficiency of the reweighted minimum covariance determinant 
estimator is usually taken care of by multiplying the test statistic by a correction factor, which has to 


be determined numerically (Groux and Haesbroeck, 1999). Using such an appropriate finite-sample 


value of cJi repairs the test, but again, it does not improve the model selection in our example. 
For the Miyamura-Kano proposal we note that they devise an alternative way of constrained 
estimation, but propose a very slow algorithm, which makes it, at least in the R implementation 
we used, unfeasible in larger dimensions. There is a tuning parameter to choose, which was set to 
0 • 3 in our experiment, following the recommendation of the authors. All calculations were done 
in R 2 • 9 • 1, employing routines from the packages mvtnorm , ggm , ICSNP , rrcov and rggm . 


8. Conclusion 


As a very simple and efficient technique to safeguard graphical modelling of continuous data 
against the impact of heavy tails, non-normality in general and, to some degree, also faulty outliers 
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Table 2. One-step model selection based on robust estimators 


distribution 

estimator 

mean 

edge 

difference 

% true 
model found 

% non-edges 
correctly 
found 

% ®/® 
correctly 
found 

normal 

RMCD 0•5 

2 

• 05 

11 

54 

85 


RMCD 0 • 5** 

2 

• 06 

5 

81 

94 


RMCD 0 • 75 

1 

• 66 

15 

72 

92 


RMCD 0 • 75** 

1 

• 69 

13 

80 

94 


M-K+ 

1 

• 61 

14 

81 

95 

G 

RMCD 0•5 

2 

• 18 

9 

45 

82 


RMCD 0 • 5** 

2 

• 13 

5 

76 

93 


RMCD 0 • 75 

2 

• 02 

11 

51 

85 


RMCD 0 • 75** 

1 

• 96 

10 

61 

89 


M-K+ 

1 

• 82 

12 

67 

91 


with finite-sample correction; + Miyamura and Kano (20061 


we recommend the use of Tyler’s estimator in place of the empirical covariance matrix. The gain 
in robustness comes at a very moderate loss in efficiency, which becomes smaller with increasing 


dimension, and a justifiable increase in computing time. Vogel et al. (2010) report average com¬ 


puting times on a 2.83 GHz Intel Core2 CPU for n = 200 and p = 50 of less than a second for 
the Tyler matrix, compared to less than three seconds for the reweighted minimum covariance 



A problem that has not been addressed in this article is the accuracy of the asymptotic approxi¬ 
mations for small to moderate sample sizes, in particular, to what extent it depends upon the ratio 
p/n. This question splits into two parts. The first is an evaluation of the hnite-sample properties 
of the affine pseudo-equivariant scatter estimators. These may be very different and do not allow 
a unified treatment. Very little seems to be known theoretically, either on the exact distribution of 
most robust scatter estimators or the rate of convergence to the Gaussian limit. However, there is 
strong empirical evidence that Tyler’s estimator has excellent small-sample properties. In all our 
simulations the difference in the empirical distributions of any univariate function of the sample 
covariance matrix at normality and the Tyler matrix Vn at any elliptical distribution, is fully 


expressed by the asymptotic scaling factor 1 + 2/p, see also Vogel et al. (2010, Figure 2). Moreover, 
it is known that Vn behaves similarly to when p and n grow large simultaneously (Diimbgenj 


1998). The second task is then, given the small-sample properties of the estimators, to assess the 


accuracy of the asymptotic distributions of the tests. This question is of relevance also in classi¬ 
cal graphical modelling, where it has been noted that the deviance test statistic may substantially 
differ from its x^ limit for small re. Improved small-sample approximations have been proposed 
( jPorteous 1985, 1989), but also the exact distribution of the deviance test statistic is known for 
decomposable models, cf. Lauritzen (1996, Sections 5.2.2 and 5.3.3). Our simulations indicate that 


finite-sample correction techniques used in Gaussian graphical modelling may be put to good use 
also under ellipticity by applying it in an analogous way to Tyler’s estimator. 

The main limitation of the affine equivariant approach is that it does not provide a solution in the 
p > n situation or allow a simple transfer of standard techniques, like regularization, that are used 
in Gaussian graphical modelling. Any affine equivariant, robust estimator requires more than p + 1 
data points, because the only affine equivariant scatter estimator in the p -|- 1 > re situation is the 
sample covariance estimator (Tyler, 2010| . Dropping the affine equivariance property is inevitable 
for robust, high-dimensional graphical modelling. 
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Appendix A. Proofs 

The proofs repeatedly apply the delta method to functions mapping matrices to matrices. We 
dehne the derivative of such a function, say, g : IR^^^ —)• at point X as the derivative of 

vecg{X) with respect to vec(A) and denote its Jacobian at point X, which is of size x by 
©^(A). The symmetry of the argument poses a technical difficulty: there are p{p + l)/2 rather 
than p^ variables, and the function g must be viewed as a function from Rp(p+i)/ 2 to in 

order to define a derivative. To deal with this issue we compute the Jacobian of g interpreted as a 
function from RJ’^P to R^^^ and post-multiply it by Mp. This is justihed by the chain rule applied 
io g = g 20 gi, where gi duplicates the off-diagonal elements and 52 • R^^^ —> R^^^. The derivatives 
below contain the right-multi plied Mp depending on whether we view the function as defined on 
S^p or on The textbook Magnus and Neudecker (1999) covers most of the tools of the proofs. 


in particular calculation rules concerning the vec operator, the Kronecker product and derivatives 
of matrix functions. We repeatedly use the following without reference. 


(vecA)^ vecB = tr(A^i?), vec{ABC) = {C'^ (g) A) vecB, 
Mp{A (g) A)Mp = Mp{A (g) A) = (A (g) A)Mp, 

for matrices A,B,C,D G R^^^ (Magnus and Neudecker, 1999, pp. 28, 30, 31). Let i ■. A ^ A~^ 


(A ®B){C®D)=AC® BD, 

Mp = 


denote matrix inversion. Its Jacobian matrix is (Magnus and Neudecker, 1999, p. 184) 

Dr(A) = -{A^)-^®A-^. 

Proof of Proposition \3^ Part (|^ follows by straightforward calculations from the delta method. 

Part (^ii|: We have Pn = h{Kn) with h ■. A^ —Aj^^'^AAj^^'^. We need to compute the derivative 

'—' ~ — 1/2 

of h in order to apply the delta method. We start by considering ho : A ^ A^ ' . Its Jacobian 

matrix Dho(A) = — ^ (g) A)^^| Jp is obtained by elementwise differentiation. Applying the 

multiplication rule to h{A) = —ho{A)Aho{A) yields 

1-1/2 ^^-1/2 


(5) Dh(A) =-Mp|/i(A)® A^i} Jp - A)^ 


’-D 


By the delta method, 

vec ^Pn — P^ = vec| /i( A) — /i(r/-i A) | 

converges in distribution to a p^-dimensional normal distribution with mean zero and covariance 
matrix 


Bh{rj-^K)r]-^WK{ai,a2) {Dh(??-^A)}^, 


which reduces to the expression given in Proposition 3.2 In particular, a 2 vanishes, since Dh{K) vecA 
0. This is generally true for scale-invariant function h. □ 


Proof of Proposition 4-1. Part (^: Since Kq = hciS) with 


2c -1 


flG ■■ A^ 


(P) 


k=l 
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in 


we want to compute the derivative of he- Let ha '■ A ^ foi' subset a C {1,... ,p}. 

The mapping ha is a composition of {■)a,a, ^ and (•)^^^. We obtain by the chain rule 

BhaiA) = - , BhaiA) = - £ Cfc {{A-la,fY''^HAala, 

k=l 

Then r]~‘^WKa{(^i,(^ 2 ) = BhGiriS)r]‘^Wsicri,a 2 ) ^BhG{r]S)'^ is shown to have the form given i 

Proposition |4.l| 0 by noting that BhG{S) vecS = yqcKg- This holds true because 

( Q—l \ {p) Q { Q—^ \ 1^*1 _ ( Q— 1 1 Ip) 

\^OL,a) \‘^a,a) > 

which is a consequence of the inversion formula for partitioned matrices. 

Part (§: Applying the delta method we have to left- and right-multiply Wkq by the Jacobian 
of i evaluated at Kg- Note that {Sg ® Sg) veciLc = vec^c- 

Part (iii): We left- and right-multiply Wkq by tbe Jacobian of h, given in evaluated at 

Kg- □ 

Proof of Corollary \4.!^ Let S € be such that hG{S) = S and write short U for B.g{S). It 
suffices to show that 2MpB,{S ® S)B. = 2MpB,. Proposition |4.1| ([^ in connection with Proposition 
6.1 identifies the left-hand side as the asymptotic covariance of /ig(S), where S is the sample 


(p) 


covariance matrix, at the normal distribution with covariance S. Formula (5.50) in Lauritzen 
(1996) identifies the same quantity as the right-hand side. □ 


In the proofs of Proposition 5.1 and Corollary 5.3 we use the following lemma. 


(m) 


Lemma A.l. Let'K^ and 
that Sni^n) satisfies Assumption 3.1 Assume furthermore that there is a continuously differentiable 


m, n G M, be as in Proposition 5.1 and Sn a shape estimator such 


function ^—)■ R with f,{Ip) = 1 such that Sn satisfies 

( 6 ) Sn(KnA^ + Inb^) = 

for any data matrix G b G R^ and full rank matrix A G R^^^. Then 

n^/^vec| 4 (xb')) - t/ 5 | iVp2 {r/(Rc^), r?^lT5(cri, 0-2)} 

and c = D^(/p) vec( 5 -i/ 2 R 5 -V 2 ), 


in distribution as n ^ 00 , where B is as in Proposition 


5.1 


The proof of Lemma A.l follows by straightforward calculations and is omitted. The constant c 
is identihed by means of the first order Taylor expansion of ^{Sn'^S~^Sh^‘^) around Ip. 


from Sn or Si- Applying Lemma A.l to we deduce part (^ii^ for analogously to part (^iP 


Towards the proof of Proposition|5.2|we state Lemmas 


A.2 


to 


A.4 


For A G let /a : 


Proof of Proposition \5.1\ Part (|i| follows by standard arguments from the asymptotic normality 
of the estimator Pg^- For any affine pseudo-equivariant estimator Sn the rescaled estimator ^ = 
det(^)“^/P^ satisfies and t he va lue of the test statistic_T„(Go, Gi) is the same, if_computed 

□ 

R: 

I I I I t' “ ■ " t' 

fA{B) = logdetR -|- tr(R“^A). From the theory of Gaussian graphical models we know that for 
any graph G and A G the matrix Ag = hG{A) is the unique solution of the constrained 
optimization problem 

(7) minimize//I(R) subject to vec/i(R) = 0, B £ , 

because Ag is the maximum likelihood estimate of the covariance matrix under the model G at 
the normal distribution, if A is the observed sample covariance, cf. Lauritzen (1996, p. 133). Now 
with the notation of Section 0 let Hq{-) = Qd(Go) vec/i(-), Ri(-) = QD{Gi)^ech{-) and iLo,i(-) = 
Qd{Go)\d{Gi) vech(-). 
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Lemma A.2. Agp = /ig'q(A) is a solution of the constrained optimization problem 

(8) minimize fAQ^{hGi{C)} subject to (C*)} = 0, C G 

Proof. By Q and Q, Aqq uniquely solves the constrained optimization problem 

(9) minimize(i?) subject to Ho{B) = 0, B £ . 

The restriction Hq[B) = 0 is equivalent to Hi{B) = 0 A Hqa{B) = 0, and any matrix B with 
Hi{B) = 0 can be written as B = hc^iC) for some C G . Thus {S | Hq{B) = 0,5 G } 
and ^ = {5 = hciiC) | 5o,i{/igi(C*)} = 0, C G } are equal, and so are the solution sets of the 
constrained optimization problems Q and 

(10) minimize /A(3^(5) subject to 5 G 

Thus Ag^ uniquely solves (10), and all matrices C G dPp with hG^iC) = Ag^, among them Agq, 


solve 


□ 


The next two lemmas are stated without proof. Expressions (12) can be deduced from the proofs 
of Propositions 3.2 and 4.1 and 0 can be assembled from the derivatives given in [Magnus and| 


Neudecker (1999 pp. 178,179). 


Lemma A.3. Let H : —)> be continuously differentiable and Gq, Gi as in Section^ Let 

furthermore Sn be a sequence of almost surely positive definite random p x p matrices, for which 
— S) converges in distribution for some S G with S~^ G yjf{Go). Then for n —>■ oo 

{77(5go) - ^(^Gi)} ~ n^/2 j)h{Sgo) vec [Sgo - Sg,) . 

Lemma A.4. For A,B€ , 

(11) E)/h(5) = yqc{B - Af {B-^ ® B-^)Mp, 

(12) mG{B) = {hG{B)®hG{B)}VLG{B)Mp, = Qqa^{B) {B-^ ® B-^) Mp. 

Proof of Proposition The second order Taylor expansion of logdet(-) is 

logdet(A+A) = logdet A + {vec(A'^)“^}'^ vecA — ^ {vec(A'^)}^ {(A^)“^ (g) vecA + o(||A||^), 


cf. Magnus and Neudecker (1999, pp. 108, 179, 184). Applying this to the deviance test statistic 
yields 

Dri{Go,Gi) = n |logdet(5Go) - logdet(5Gi)| = -nlogdet 


= —ntr 


{Sg 


Q-l _ T 
1 ‘^Go 


n 

+ jtr 


(sg,s5,; 


-L 


+ o (^^II'SgiSg^ - 


(13) 


{vec (^Gi - 5 go) Y {Sal ® Sal) V®" Yg, - Sg, 


n 


oo. 


The asymptotic equivalence follows because 


(1) tr(5Gi4;-4) = 


I vec {-Sgi — *5go{| vec^g,^ = 0, which is a consequence of (3), and 


(2) n\\SG,Sal - IpW^ < (nV2||5^^ _ 5|| + nV2||5Go - 5||)'= Op(l), 
Applying Lemma A.3 to PI = /ig^ and using we find further 

Ygo - Sgi) ~ {Sgo ®-^Go) ^Gi('5go)^p vec {^Go - <Sgi 


n 


oo. 


vec 


and from (13) 


(14) 5„(Go,Gi) ~ ^{vec(5Gi-5Go)}^MpflGi(5Go)vec(5G,-5 go), 


n 


oo. 
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Next we introduce the Lagrange multiplier (Magnus and Neudecker 


1999 


p. 131). Since Sgq solves 
the constrained optimization problem ([^ with A = Sn, there exists a vector A G lR®’i such that 

ro {fsa^ o ho,) (5go) = A^D (Fo,i o he,) [Sgo) , 

which transforms to Mpflci(*§ 00 ) vec(<SGi — *§Go) = ^p^Gi{Sgo)^{Sgq)'^Qo cf. Lemma 

^ 1/2 ^ 1/2 

We left-multiply both sides by (8> Sq^ and solve for A. 


A.4 


MpnGi(<5Go) vec (^Sgi - Sgo 

= Mp^GiiSGoWiSGo)'^ Qo,l |Qo,1^Gi(-5go)Qri} Qo,ir(5Go)AfpnGi (‘^Gq) vec (^^Gi - Sgo^ ■ 

We substitute the right-hand side for the left-hand side of this equation in (14), apply again Lemma 
A.3 this time to H = iLo,i o ^Gi, which leads to 


vecLci ~ ^^'^^Qo,ir(5Go)MpnGi(*5Go) vec ^5 gi - 5go) , 

and obtain 

DniGo,Gi) ~ - ^vecPGi) |Qo,i-Rgi(* 500)13^1} Qo^vecLbu 
Finally Rgi{Sgo) ~ Rgi{^) as n —)■ 00 , since both sides converge to Rgi{S). 


n 


00 . 


□ 


Proof of Corollary \5.3[ Part (1) is straightforward. For p art (2 ) we take, as in Proposition 5.1 
the detour via Sp = det(<S„)“^/P^ and make use of Lemma A.l to ensure that meets the 

assumptions of Proposition |5.2[ □ 
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